Overview
This competition challenges you to predict which responses users will prefer in a head-to-head battle between chatbots powered by large language models (LLMs). You'll be given a dataset of conversations from the Chatbot Arena, where different LLMs generate answers to user prompts. By developing a winning machine learning model, you'll help improve how chatbots interact with humans and ensure they better align with human preferences.

This is a followup to the first Human Preference Prediction competition, which focused on English conversations. This iteration will require working with conversations in many different languages.

This competition was selected for the WSDM Cup 2025. The 18th ACM International Conference on Web Search and Data Mining will take place March 10-14th in Hannover Germany.

Large language models (LLMs) are rapidly entering our lives, but ensuring their responses resonate with users is critical for successful interaction. This competition presents a unique opportunity to tackle this challenge with real-world data and help us bridge the gap between LLM capability and human preference.

We utilized a large dataset collected from Chatbot Arena, where users chat with two anonymous LLMs and choose the answer they prefer. Your task in this competition is to predict which response a user will prefer in these head-to-head battles.

This challenge aligns with the concept of "reward models" or "preference models" in reinforcement learning from human feedback (RLHF). Previous research has identified limitations in directly prompting an existing LLM for preference predictions. These limitations often stem from biases such as favoring responses presented first (position bias), being overly verbose (verbosity bias), or exhibiting self-promotion (self-enhancement bias).

We encourage you to explore various machine-learning techniques to build a model that can effectively predict user preferences. Your work will be instrumental in developing LLMs that can tailor responses to individual user preferences, ultimately leading to more user-friendly and widely accepted AI-powered conversation systems.

Evaluation
Submissions will be evaluated based on their categorization accuracy.

Submission File
For each id in the test set, you must predict the target class. The file should contain a header and have the following format:

 id,winner
 123,model_a
 456,model_b
 789,model_a
 etc


Prizes
1st Place - $12,000
2nd Place - $10,000
3rd Place - $10,000
4th Place - $10,000
5th Place - $8,000
Code Requirements


This is a Code Competition
Submissions to this competition must be made through Notebooks. In order for the "Submit" button to be active after a commit, the following conditions must be met:

CPU Notebook <= 4.75 hours run-time during the training phase, 12 hours during the forecasting phase.
GPU Notebook <= 4.75 hours run-time during the training phase, 12 hours during the forecasting phase.
Internet access disabled
Freely & publicly available external data is allowed, including pre-trained models
Submission file must be named submission.csv
Please see the Code Competition FAQ for more information on how to submit. And review the code debugging doc if you are encountering submission errors.

WSDM Cup 2025
This competition was selected for the WSDM Cup 2025. Top submissions for the competition will be invited to give talks at the conference. Attending the conference is not required to participate in the competition, however only teams that are attending the conference will be considered to present their work.
Competitions of this form are a crucial part of the research ecosystem, bringing together world-class experts from all over the world to independently evaluate their own best ideas on an important problem. Past experience has shown this process to create both amazing progress on specific problems, but also help to achieve a depth of validation and understanding in a way that lives up to the highest ideals of empirical rigor for AI and ML as research field.

Attendees presenting in person are responsible for all costs associated with travel, expenses, and fees to attend WSDM Cup 2025.

Citation
Wei-lin Chiang, Evan Frick, Lisa Dunlap, Anastasios Angelopoulos, Joseph E. Gonzalez, Ion Stoica, Sohier Dane, Maggie Demkin, and Nate Keating. WSDM Cup - Multilingual Chatbot Arena. https://kaggle.com/competitions/wsdm-cup-multilingual-chatbot-arena, 2024. Kaggle.

Dataset Description
The competition dataset consists of user interactions from the ChatBot Arena (formerly LMSYS). In each user interaction a judge provides one prompt to two different large language models and then indicates which of the models gave the more satisfactory response.

This is a Code Competition. When your submission is scored the example test data will be replaced with the full test set.

Competition Phases and Data Updates
The competition will proceed in two phases:

A model training phase with a test set of historical data. This test set has about 10,000 rows.
A forecasting phase with a test set to be collected after the submission deadline. You should expect this test set to contain at most 25,000 rows. There may be fewer if that turns out to be necessary in order to maintain the schedule.
The Chatbot Arena team may release additional data during the model training phase.

Files
train.parquet

id - A unique string identifier for the row.
prompt - The prompt that was given as an input to both models.
response_[a/b] - The response from model_[a/b] to the given prompt.
winner - The judge's selection. The ground truth target column.
model_[a/b] - The identity of model_[a/b]. Only included in train.parquet.
language - The language used in the prompt. Only included in train.parquet.
test.parquet

id - A unique integer identifier for the row.
prompt
response_[a/b]
scored - Whether or not the row is currently scored. During the model training phase this will be true for rows used for the public leaderboard; during the forecasting phase this will be true for rows used for the private leaderboard.
sample_submission.csv A submission file in the correct format.

id
winner
Note that the dataset for this competition contains text that may be considered profane, vulgar, or offensive.